Designing Special Post-Processing Rules for SVM-Based Chinese Word Segmentation
نویسندگان
چکیده
We participated in the Third International Chinese Word Segmentation Bakeoff. Specifically, we evaluated our Chinese word segmenter NEUCipSeg in the close track, on all four corpora, namely Academis Sinica (AS), City University of Hong Kong (CITYU), Microsoft Research (MSRA), and University of Pennsylvania/University of Colorado (UPENN). Based on Support Vector Machines (SVMs), a basic segmenter is designed regarding Chinese word segmentation as a problem of character-based tagging. Moreover, we proposed postprocessing rules specially taking into account the properties of results brought out by the basic segmenter. Our system achieved good ranks in all four corpora. 1 SVM-based Chinese Word Segmenter We built out segmentation system following (Xue and Shen, 2003), regarding Chinese word segmentation as a problem of character-based tagging. Instead of Maximum Entropy, we utilized Support Vector Machines as an alternate. SVMs are a state-of-the-art learning algorithm, owing their success mainly to the ability in control of generalization error upper-bound, and the smooth integration with kernel methods. See details in (Vapnik, 1995). We adopted svm-light1 as the specific implementation of the model. 1.1 Problem Formalization By formalizing Chinese word segmentation into the problem of character-based tagging, we ashttp://svmlight.joachims.org/ signed each character to one and only one of the four classes: word-prefix, word-suffix, word-stem and single-character. For example, given a two-word sequence“东南亚 人”, the Chinese words for ”Southeast Asia(东 南亚) people(人) ”, the character “东”is assigned to the category word-prefix, indicating the beginning of a word;“南”is assigned to the category word-stem, indicating the middle position of a word; “亚”belongs to the category word-suffix, meaning the ending of a Chinese word; and last,“人”is assigned to the category single-character, indicating that the single character itself is a word. 1.2 Feature Templates We utilized four of the five basic feature templates suggested in (Low et al. , 2005), described as
منابع مشابه
Context-Based Chinese Word Segmentation using SVM Machine-Learning Algorithm without Dictionary Support
This paper presents a new machine-learning Chinese word segmentation (CWS) approach, which defines CWS as a break-point classification problem; the break point is the boundary of two subsequent words. Further, this paper exploits a support vector machine (SVM) classifier, which learns the segmentation rules of the Chinese language from a context model of break points in a corpus. Additionally, ...
متن کاملDo Chinese Readers Follow the National Standard Rules for Word Segmentation during Reading?
We conducted a preliminary study to examine whether Chinese readers' spontaneous word segmentation processing is consistent with the national standard rules of word segmentation based on the Contemporary Chinese language word segmentation specification for information processing (CCLWSSIP). Participants were asked to segment Chinese sentences into individual words according to their prior knowl...
متن کاملWord Segmenter for Chinese Micro-blogging Text Segmentation - Report for CIPS-SIGHAN'2014 Bakeoff
This paper presents our system for the CIPSSIGHAN-2014 bakeoff task of Chinese word segmentation. This system adopts a characterbased joint approach, which combines a character-based generative model and a character-based discriminative model. To further improve the performance in cross-domain, an external dictionary is employed. In addition, pre-processing and post-processing rules are utilize...
متن کاملHigh OOV-Recall Chinese Word Segmenter
For the competition of Chinese word segmentation held in the first CIPS-SIGHNA joint conference. We applied a subwordbased word segmenter using CRFs and extended the segmenter with OOV words recognized by Accessor Variety. Moreover, we proposed several post-processing rules to improve the performance. Our system achieved promising OOV recall among all the participants.
متن کاملISCAS: A Cascaded Approach for CIPS-SIGHAN Micro-Blog Word Segmentation Bakeoff 2012 Track
The state-of-the-art Chinese word segmentation systems have achieved high performance on well-formed long document. However, the segmentation for microblog is difficult due to the noise problem and the OOV problem. In this paper, we present a Chinese Micro-Blog Segmentation system for the CIP-SIGHAN Word Segmentation Bakeoff 2012 track. The proposed system adopts a cascaded approach which conta...
متن کامل